1. Data visualisation principles

Minimize noise, maximize signal in your graphs (or put it in other ways: maximize the data-ink ratio):

source: Darkhorse Analytics

  • avoid chart junk
  • Choose the type of plot depending on the type of data
  • label chart elements properly and informatively
  • ideally both x and y axis starts at 0 (scales can be really deceiving otherwise)
  • use consistent units! (do not mix yearly and month GDP for example)
  • ABSOLUTELY NO 3D PIE CHARTS. (When someone does 3D pie charts God makes a kitten cry.)

Me, seeing 3D charts (I am trigerred equally no matter the sub genre):

Resources: Some examples in this workshop are adapted from the great Data Visualization - A practical introduction from Kieran Healey. More on dataviz theory and best practice: Fundamentals of Data Visualization by Claus O. Wilke ggplot2 cheat sheet list of ggplot2 extensions

2. ggplot2 and its extensions

The name stands for grammar of graphics and it enables you to build your plot layer by layer and having the ability to control every detail of the output (if you so wish). It is used by many in academia, by the Financial Times and FiveThirtyEight writers, among many others. During this workshop we will go through various types of data visualisations and try to apply the above set principles to our output.

You create plots with the below syntax:

library(readr)
library(dplyr)
library(ggplot2)
library(ggridges)
library(ggthemes)
library(gapminder)
# data
iris_df <- iris

tarantino_538 <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/tarantino/tarantino.csv")

gapminder_df <- gapminder

stocks <- read_csv("data/stock_data.csv")

2.0 building our first ggplot

Let’s create the foundation of our plot by specifying for ggplot the data we use and the variable we want to plot.

p_hist <- ggplot(data = gapminder_df,
                 mapping = aes(x = gdpPercap))

# what happens if we just use this?
p_hist

We need to specify what sort of shape we want our data to be displayed. We can do this by adding the geom_histogram() function with a +

p_hist + 
    geom_histogram()

Looks a little bit skewed. Let’s log transform our variable with the scale_x_log10() function.

p_hist + 
    geom_histogram() +
    scale_x_log10()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As the message says, we can mess around with the binwidth argument, so let’s do that.

p_hist + 
    geom_histogram(binwidth = 0.05) +
    scale_x_log10()

2.1 scatter plot (geom_point())

We use scatter plot to illustrate some association between two continuous variable. Usually, the y axis is our dependent variable (the variable which is explained) and x is the independent variable, which we suspect that drives the association.

Now, we want to know what is the association between the gdp per capita and life expectancy

ggplot(gapminder_df,
             mapping = aes(x = gdpPercap,
                           y = lifeExp)) +
    geom_point()

Now that we have a basic figure, let’s make it better. We transform the x axis values with the scale_x_log10() and add text to our plot with the labs() function. Within geom_point() we can also specify geom specific options, such as the alpha level (transparency).

ggplot(data = gapminder_df,
             mapping = aes(x = gdpPercap,
                           y = lifeExp)) +
    geom_point(alpha = 0.25) + # inside the geom_ we can modify its attributes. Here we set the transparency levels of the points
    scale_x_log10() + # rescale our x axis
    labs(x = "GDP per capita", 
         y = "Life expectancy",
         title = "Connection between GDP and Life expectancy",
         subtitle = "Points are country-years",
         caption = "Source: Gapminder")

To add some analytical power to our plot we can use geom_smooth() and choose a method for it’s smoothing function. It can be lm, glm, gam, loess, and rlm. We will use the linear model (“lm”). Note: this is purely for illustrative purposes, as our data points are country-years, so “lm” is not a proper way to fit a regression line to this data.

ggplot(data = gapminder_df,
             mapping = aes(x = gdpPercap,
                           y = lifeExp)) +
    geom_point(alpha = 0.25) + 
    scale_x_log10() +
    geom_smooth(method = "lm", se = TRUE, color = "orange") + # adding the regressiom line
    labs(x = "GDP per capita", 
         y = "Life expectancy",
         title = "Connection between GDP and Life expectancy",
         subtitle = "Points are country-years",
         caption = "Source: Gapminder")

what if we want to see how each continent fares in this relationship? We need to include a new argument in the mapping function: color =. Now it is clear that European countries (country-years) are clustered in the high-GDP/high life longevity upper right corner.

ggplot(data = gapminder_df,
             mapping = aes(x = gdpPercap,
                           y = lifeExp,
                           color = continent)) + # color by category
    geom_point(alpha = 0.5) + 
    scale_x_log10() + # rescale our x axis
    labs(x = "GDP per capita", 
         y = "Life expectancy",
         title = "Connection between GDP and Life expectancy",
         subtitle = "Points are country-years",
         caption = "Source: Gapminder")

We add horizontal line or vertical line to our plot, if we have a particular cutoff that we want to show. We can add these with the geom_hline() and geom_vline() functions.

ggplot(data = gapminder_df,
             mapping = aes(x = gdpPercap,
                           y = lifeExp,
                           color = continent)) + # color by category
    geom_point(alpha = 0.5) + 
    scale_x_log10() +
    geom_vline(xintercept = 3500) + # adding vertical line 
    geom_hline(yintercept = 70, linetype = "dashed", color = "black", size = 1) + # adding horizontal line
    
    labs(x = "GDP per capita", 
         y = "Life expectancy",
         title = "Connection between GDP and Life expectancy",
         subtitle = "Points are country-years",
         caption = "Source: Gapminder")

2.2 histogram

Using histograms to check the distribution of your data as we have seen in the intro.

ggplot(gapminder_df,
       mapping = aes(x = lifeExp)) +
    geom_histogram() 

ggplot(gapminder_df,
       mapping = aes(x = lifeExp)) +
    geom_histogram(binwidth = 1, color = "black", fill = "orange") # we can set the colors and border of the bars and set the binwidth or bins 

We can overlay more than one histogram on each other. See how different iris species have different sepal length distribution.

ggplot(data = iris_df,
       mapping = aes(x = Sepal.Length,
                     fill = Species)) +
    geom_histogram(binwidth = 0.1, position = "identity", alpha = 0.65) # using the position option so we can see all three variables

2.3 density plots

A variation on histograms is called density plots that uses Kernel smoothing (fancy! but in reality is a smoothing function which uses the weighted averages of neighboring data points.)

ggplot(iris_df,
       mapping = aes(x = Sepal.Length)) +
    geom_density()

Add some fill

ggplot(iris_df,
       mapping = aes(x = Sepal.Length)) +
    geom_density(fill = "orange", alpha = 0.3)

Your intutition is correct, we can overlap this with our histogram

ggplot(iris_df,
       mapping = aes(x = Sepal.Length)) +
    geom_histogram(aes(y = ..density..),
                   binwidth = 0.1,
                   fill = "white",
                   color = "black") +# we add this so the y axis is density instead of count.
    geom_density(alpha = 0.25, fill = "orange")

And similarly to the historgram, we can overlay two or more density plot as well.

ggplot(iris_df,
       mapping = aes(x = Sepal.Length,
                     fill = Species)) +
    geom_density(alpha = 0.5)

2.3.1 ridgeline/joyplot

This one is quite spectacular looking and informative. It has a similar function as the overlayed histograms but presents a much clearer data. For this, we need the ggridges package which is a ggplot2 extension.

ggplot(data = iris_df,
       mapping = aes(x = Sepal.Length,
                     y = Species,
                     fill = Species)) +
    geom_density_ridges(scale = 0.8, alpha = 0.5)

2.4 bar charts

We can use the bar charts to visualise categorical data. Let’s prep some data.

tarantino_rip <- tarantino_538 %>% 
    filter(type == "death")

ggplot(data = tarantino_rip,
       aes(x = movie)) +
    geom_bar()

We can use the fill option to map another variable onto our plot. Let’s see how these categories are further divided by the type of event in the movies (profanity or death). By default we get a stacked bar chart.

ggplot(tarantino_538, aes(movie, fill = type)) +
    geom_bar()

we can use the position function in the geom_bar to change this. Another neat trick to make our graph more readable is coord_flip.

ggplot(tarantino_538, aes(movie, fill = type)) +
    geom_bar(position = "dodge") +
    coord_flip()

Let’s make sure that the bars are proportional. For this we can use the y = ..prop.. and group = 1 arguments, so the y axis will be calculated as proportions. The ..prop.. is a temporary variable that has the .. surrounding it so there is no collision with a variable named prop.

ggplot(tarantino_538, aes(movie, fill = type)) +
    geom_bar(position = "dodge",
             aes(y = ..prop.., group = type)) +
    coord_flip()

Maybe it is best to facet by type.

ggplot(tarantino_538, aes(movie, fill = type)) +
    geom_bar(position = "dodge",
             aes(y = ..prop.., group = type)) +
    coord_flip() +
    facet_wrap(~type, ncol = 2)

2.4.1 Lollipop charts

The lollipop chart is a better barchart in a sense that it conveys the same information with better data/ink ratio. It also looks better. (note: some still consider it a gimmick)

For this we will modify a chart from the Data Visualisation textbook

# for the data see the github repository of the workshop

load("data/oecd_sum.rda") 

p <- ggplot(data = oecd_sum,
       mapping = aes(x = year, y = diff, color = hi_lo)) 


p + geom_segment(aes(y = 0, x = year, yend = diff, xend = year)) +
    geom_point() +
    theme(legend.position="none") +
    labs(x = NULL, y = "Difference in Years",
       title = "The US Life Expectancy Gap",
       subtitle = "Difference between US and OECD
                   average life expectancies, 1960-2015",
       caption = "Adapted from Kieran Healy: Data Visualisation, fig.4.21 ")
#> Warning: Removed 1 rows containing missing values (geom_segment).
#> Warning: Removed 1 rows containing missing values (geom_point).

2.5 box plot

ggplot(data = iris_df,
       mapping = aes(x = Species,
                     y = Sepal.Length)) +
    geom_boxplot()

We add color coding to our boxplots as well.


ggplot(data = iris_df,
       mapping = aes(x = Species,
                     y = Sepal.Length,
                     fill = Species)) +
    geom_boxplot(alpha = 0.5)

2.6 violin chart

ggplot(data = iris_df,
       mapping = aes(x = Species,
                     y = Sepal.Length)) +
    geom_violin()

2.7 line chart

For this we use data on stock closing prices. As we are now familiar with the ggplot2 syntax, I do not write out all the data = and mapping =.

ggplot(stocks, aes(date, stock_closing, color = company)) +
    geom_line()

Add some refinements.

ggplot(stocks, aes(date, stock_closing, color = company)) +
    geom_line(size = 1) +
    labs(x = "", y = "Prices (USD)",
         title = "Closing daily prices for selected tech stocks",
         subtitle = "Data from 2016-01-10 to 2018-01-10",
         caption = "source: Yahoo Finance")

faceting helps.

ggplot(stocks, aes(date, stock_closing, color = company)) +
    geom_line(size = 1) +
    labs(x = "", y = "Prices (USD)",
         title = "Closing daily prices for selected tech stocks",
         subtitle = "Data from 2016-01-10 to 2018-01-10",
         caption = "source: Yahoo Finance") +
    facet_wrap(~company, nrow = 4)

3. Themes and plot elements

3.1 Themes

In this section we will go over some of the elements that you can modify in order to get an informative and nice looking figure. ggplot2 comes with a number of themes. You can play around the themes that come with ggplot2 and you can also take a look at the ggthemes package, where I included the economist theme. Another notable theme is the hrbthemes package.

Try out a couple to see what they differ in! The ggthemes package has a nice collection of themes to use. The theme presets can be used with the theme_*() function.

ggplot(data = gapminder_df,
             mapping = aes(x = gdpPercap,
                           y = lifeExp)) +
    geom_point(alpha = 0.25) + 
    scale_x_log10() + 
    theme_minimal() # adding our chosen theme

3.2 Plot elements

Of course we can set all elements to suit our need, without using someone else’s theme.

The key plot elements that we will look at are:

  • labels
  • gridlines
  • fonts
  • colors
  • legend
  • axis breaks

Adding labels, title, as we did before.

ggplot(data = gapminder_df,
             mapping = aes(x = gdpPercap,
                           y = lifeExp,
                           color = continent)) +
    geom_point(alpha = 0.5) + 
    scale_x_log10() + 
    labs(x = "GDP per capita", 
         y = "Life expectancy",
         title = "Connection between GDP and Life expectancy",
         subtitle = "Points are country-years",
         caption = "Source: Gapminder",
         color = "Continent") # changing the legend title

Let’s use a different color scale! We can use a color brewer scale (widely used for data visualization).

ggplot(data = gapminder_df,
             mapping = aes(x = gdpPercap,
                           y = lifeExp,
                           color = continent)) +
    geom_point(alpha = 0.5) + 
    scale_x_log10() + 
    scale_color_brewer(name = "Continent", palette = "Set1") # adding the color brewer color scale

Or we can define our own colors:


ggplot(data = gapminder_df,
             mapping = aes(x = gdpPercap,
                           y = lifeExp,
                           color = continent)) +
    geom_point(alpha = 0.5) + 
    scale_x_log10() + 
    scale_color_manual(values=c("red", "blue", "orange", "black", "green")) # adding our manual color scale

To clean up clutter, we will remove the background, and only leave some of the grid behind. We can hide the tickmarks with modifying the theme() function, and setting the axis.ticks to element_blank(). Hiding gridlines also requires some digging in the theme() function with the panel.grid.minor or .major functions. If you want to remove a gridline on a certain axis, you can specify panel.grid.major.x. We can also set the background to nothing. Furthermore, we can define the text attributes as well in our labels.

ggplot(data = gapminder_df,
             mapping = aes(x = gdpPercap,
                           y = lifeExp,
                           color = continent)) +
    geom_point(alpha = 0.5) + 
    scale_x_log10() + 
    theme(axis.ticks = element_blank(), # removing axis ticks
          panel.grid.minor = element_blank(), 
          panel.background = element_blank()) # removing the background

Finally, let’s move the legend around. Or just remove it with theme(legend.position="none"). We also do not need the background of the legend, so remove it with legend.key, and play around with the text elements of the plot with text.


ggplot(data = gapminder_df,
             mapping = aes(x = gdpPercap,
                           y = lifeExp,
                           color = continent)) +
    geom_point(alpha = 0.5) + 
    scale_x_log10() + 
    theme(axis.ticks = element_blank(), # removing axis ticks
          panel.grid.minor = element_blank(), # removing the gridline
          panel.background = element_blank(), # removing the background
          legend.title = element_text(size = 12), # setting the legends text size
          text = element_text(face = "plain", family = "sans"), # setting global text options for our plot
          legend.key=element_blank(),
          legend.position = "bottom")# removing the background

While we are at it, we want to have labels for our data. For this, we’ll create a plot which can exploit this.

What we use is the geom_text to have out labels in the chart.

gapminder <- gapminder %>% 
    filter(year == 2002, continent == "Europe")


ggplot(gapminder, aes(lifeExp, gdpPercap, label = country)) + # we add the labels!
    geom_point() +
    geom_text() # and use the geom text

notice the different outcome of geom_label instead of geom_text.

ggplot(gapminder, aes(lifeExp, gdpPercap, label = country)) + # we add the labels!
    geom_point() +
    geom_label() # and use the geom label

If we want to label a specific set of countries we can do it from inside ggplot, without needing to touch our data.

ggplot(gapminder, aes(lifeExp, gdpPercap, label = country)) + # we add the labels!
    geom_point() +
    geom_text(aes(label = if_else(lifeExp > 80, country, NULL)), nudge_x = 0.5) # we add a conditional within the geom. Note the nudge_x!
#> Warning: Removed 26 rows containing missing values (geom_text).

4. Special cases

4.1 Network visualization

4.2 Maps